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Abstract The task of the emotion recognition in the 
wild (EmotiW) Challenge is to assign one of seven emo¬ 
tions to short video clips extracted from Hollywood 
style movies. The videos depict acted-out emotions un¬ 
der realistic conditions with a large degree of variation 
in attributes such as pose and illumination, making it 
worthwhile to explore approaches which consider com¬ 
binations of features from multiple modalities for label 
assignment. 

In this paper we present our approach to learn¬ 
ing several specialist models using deep learning tech¬ 
niques, each focusing on one modality. Among these 
are a convolutional neural network, focusing on captur¬ 
ing visual information in detected faces, a deep belief 
net focusing on the representation of the audio stream, 
a K-Means based “bag-of-mouths” model, which ex¬ 
tracts visual features around the mouth region and a re¬ 
lational autoencoder, which addresses spatio-temporal 
aspects of videos. 
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We explore multiple methods for the combination of 
cues from these modalities into one common classifier. 
This achieves a considerably greater accuracy than pre¬ 
dictions from our strongest single-modality classifier. 
Our method was the winning submission in the 2013 
EmotiW challenge and achieved a test set accuracy of 
47.67% on the 2014 dataset. 

Keywords Emotion recognition • Deep learning • 
Model combination • Multimodal learning 


1 Introduction 

This is an extended version of the paper describing our 
winning submission [22] to the Emotion Recognition in 
the Wild Challenge (EmotiW) in 2013 [IT] . Here we de¬ 
scribe our approach in more detail and present results 
on the new data set from the 2014 competition m- 
The task in this competition is to assign one of seven 
emotion labels (angry, disgust, fear, happy, neutral, sad, 
surprise) to each short video clip in the Acted Facial Ex¬ 
pression in the Wild (AFEW) dataset [12]. The video 
clips are extracted from feature films. Given the low 
number of samples per emotion category, it is difficult 
to deal with the large variety of subjects, lighting condi¬ 
tions and poses in these close-to-real-world videos. The 
clips are approximately 1 to 2 seconds long and also 
feature an audio track, which might contain voices and 
background music. 

We explore different methods of combining predic¬ 
tions of modality-specific models, including: (1) a deep 
convolutional neural network (ConvNet) trained to rec¬ 
ognize facial expressions in single frames; (2) a deep 
belief net that is trained on audio information; (3) a 
relational autoencoder that learns spatio-temporal fea¬ 
tures, which help to capture human actions; and (4) a 
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shallow network that is trained on visual features ex¬ 
tracted around the mouth of the primary human sub¬ 
ject in the video. We discuss each model, their perfor¬ 
mance characteristics and different aggregation strate¬ 
gies. The best single model, without considering com¬ 
binations with other experts, is the ConvNet trained to 
predict emotions given still frames. It has been trained 
only on additional facial expression datasets, i.e. not 
using the competition data. The ConvNet was then 
used to extract class probabilities for the competition 
data. The extracted probability vectors of the challenge 
training and validation sets were aggregated to fixed- 
length vectors and then used to train and validate hy¬ 
perparameters of a support vector machine (SVM) for 
final classification. This yielded a test set accuracy of 
35.58% for the 2013 dataset. Using our best strategy (at 
the time) for the combination of top performing expert 
models into a single predictor, we were able to achieve 
an accuracy of 41.03% on the 2013 challenge test set. 
The next best competitor achieved a test accuracy of 
35.89%. We reran our pipeline on the 2014 challenge 
data with improved settings for our combination model 
and achieved a test set accuracy of 47.67%, compared 
to 50.37% reported by the challenge winners m ■ 


2 Related work 

The task of recognizing the emotion to associate with 
a short video clip is well suited for methods and mod¬ 
els that combine features from different modalities. As 
such, many other successful approaches in the Emo¬ 
tion recognition in the Wild (EmotiW) 2013 and 2014 
challenges focus on the fusion of modalities. These in¬ 
clude [32], who used Multiple Kernel Learning (MKL) 
for fusion of visual and audio features. The recent suc¬ 
cess of deep learning methods in challenging computer 
vision EHEPEIL language modeling [231 and speech 
recognition [18] tasks seems to carry over to emotion 
recognition, taking into account that the 2014 challenge 
winners [3Q] also employed a deep convolutional neural 
net, which they combined with other visual and audio 
features using a Partial Least Squares (PLS) classifier. 
The adoption of deep learning for visual features likely 
played a big role in the considerable improvement com¬ 
pared to their submission in the 2013 competition [29] . 
although the first and second runners up also reached 
quite good performances without deep learning meth¬ 
ods; [34] used a hierarchical classifier for combining au¬ 
dio and video features and [7] introduced an extension 
of Histogram of Oriented Gradients (HOG) descriptors 
for spatio-temporal data, which they fuse with other 
visual and audio features using MKL. 


3 Models for modality-specific representation 
learning 

3.1 A convolutional network approach for faces 

ConvNets are artificial neural network architectures, 
that assume a topological input space, e.g. a 2d image 
plane. A set of two-dimensional or three-dimensional (if 
the inputs are color images) filters is applied to small 
regions over the whole image using convolution, yield¬ 
ing a bank of filter response maps (one map per filter), 
which also exhibit a similar 2d topology. 

To reduce the dimensionality of feature banks and to 
introduce invariance with respect to slight translations 
of the input image, convolutional layers are often fol¬ 
lowed by a pooling layer, which subsample the feature 
maps by collapsing small regions into a single element 
(for instance by choosing the maximum or mean value 
in the region). ConvNets have recently been shown to 
achieve state of the art performance in challenging ob¬ 
ject recognition tasks m 

Because of the small number of training samples, 
our initial experiments with ConvNets showed severe 
overfitting on the training set, achieving an accuracy 
of 96.73% on the AFEW2 training set, compared to 
only 35.32% on the validation set. For this reason we 
decided to train on a separate dataset, which we refer 
to as ’extra data’. It consists of two face image datasets 
and is described in Section 13.1.1 1 

The approach for the face modality can roughly be 
divided into four stages: 


1. Training the ConvNet on faces from extra data. The 
architecture is described in Section 13.1.21 

2. Extraction of 7-class probabilities for each frame of 
the facetubes (described in Section 3.1.3). 

3. Aggregation of single frame probabilities into fixed- 
length video descriptors for each video in the com¬ 
petition dataset by expansion or contraction. 

4. Classification of all video-clips using a support vec¬ 
tor machine (SVM) trained on video descriptors of 
the competition training set. 


Stage three and four are described in detail in Section 
|3.1.4| The pipeline is depicted in Figure [l] The strat¬ 
egy of training on extra data and using the competi¬ 
tion data only for classifier training and early stopping 
yielded a much lower training set accuracy of 46.87%, 
but it achieved a considerably better validation set ac¬ 
curacy of 38.96%. 
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3.1.1 Additional Face Dataset 

The ’extra data’ we used for training of the deep net¬ 
work is composed of two large static image datasets of 
facial expressions for the seven emotion classes. 

The first and larger one is the Google dataset 0 
consisting of 35,887 images with the seven facial ex¬ 
pression classes: angry, disgust, fear, happy, sad, sur¬ 
prise and neutral. The dataset was built by harvest¬ 
ing images returned from Google’s image search using 
keywords related to expressions, then cleaned and la¬ 
beled by hand. We use the grayscale 48 x 48 pixel ver¬ 
sions of these images. The second one is the Toronto 
Face Dataset (TFD) [35] containing 4,178 images la¬ 
beled with basic emotions, essentially with only fully 
frontal facing poses. 

To make the datasets compatible (there are big dif¬ 
ferences, for instance variation among subjects, lighting 
and poses), we applied the following registration and il¬ 
lumination normalization strategies: 

Registration To build a common dataset, TFD images 
and frames from the competition dataset had to be in¬ 
tegrated with the Google dataset, for which we used 
the following procedure: For image registration we used 
51 of the 68 facial keypoints extracted by the mixture 
of trees method from [40] . The face contour keypoints 
returned by this model were ignored in the registra¬ 
tion process. Images from the Google dataset and the 
AFEW datasets have different poses, but most faces are 
frontal views. 

To reduce noise, the mean shape of frontal pose faces 
for each dataset was used to compute the transforma¬ 
tion between the two shapes. For the transformation the 
Google data was considered as base shape and the sim¬ 
ilarity transformation was used to define the mapping. 
After inferring this mapping, all data was mapped to 
the Google data. TFD images have a tighter fit around 
faces, while Google data includes a small border around 
the faces. To make the two datasets compatible, we 
added a small noisy border to all images of TFD. 

Illumination normalization using isotropic smoothing 
To compensate for varying illumination in the merged 



Figure 2 Raw images at the top and the corresponding IS- 
preprocessed images below. 


dataset, we used the diffusion-based approach intro¬ 
duced in PH- We used the isotropic smoothing (IS) 
function from the INface toolbox [331138] with the de¬ 
fault smoothness parameter and without normalization 
as post-processing. A comparison of original and IS-pre- 
processed face images is shown in figure [2| 

3.1.2 Extracting frame-wise emotion probabilites 

Our ConvNet uses the C++ and CUDA implemen¬ 
tation written by Alex Krizhevsky [26] interfaced in 
Python. The network’s architecture used here is pre¬ 
sented in Figure[3j The ConvNet takes batches of 48x48 
images as input and performs a random cropping into 
smaller 40 x 40 sub-images at each epoch. These images 
are then randomly flipped horizontally with a proba¬ 
bility of 0.5. These two common methods allow us to 
expand the limited training set and avoid over-fitting. 

The ConvNet architecture has 4 stages containing 
different layers. The first two stages include a convo¬ 
lutional layer followed by a pooling layer, then a local 
response normalization layer [27] . The third stage in¬ 
cludes only a convolutional layer followed by a pool¬ 
ing layer. Max-pooling is used in the first stage, while 
average-pooling is used in the next stages. The last 
stage consists of seven softmax units, which output 
seven probabilities, one for each of the seven emotion la¬ 
bels. The activation function used in the convolutional 
layers is the rectified linear unit (ReLU) activation func¬ 
tion. The two first convolutional layers use 64 filters 
each, and the last one 128, all of size 5x5 pixels. Each 
convolutional layer has the same learning parameters: 
a 0.001 learning rate for the filters and 0.002 for bi¬ 
ases, 0.9 momentum for both filters and biases and a 
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weight decay of 0.004 per epoch. The fully-connected 
layer shares the same hyperparameters except for the 
weight decay, which we set to 1. These hyperparameters 
are the same as the one provided by Krizhevsky [26j in 
his example layers configuration files. The architecture 
is depicted in Figure [3] 

Classification at test time is done using the 40 x 40 
sub-images cropped from the center of the original im¬ 
ages. We stopped learning at 453 epochs using early- 
stopping on the competition validation and train sets. 
As stated earlier, we only used extra data to train the 
network, and the competition training and validation 
datasets were only used for early stopping and the sub¬ 
sequent training of the SVM. 

A shallower ConvNet was explored for the 2013 com¬ 
petition. It performed worse than ConvNet 1 and we 
did not revisit it for the 2014 dataset. In the tables for 
the AFEW 2 results, it is referred to as ConvNet 2 . For 
details on the architecture see [ 22 ] . 

3.1.3 Facetube extraction procedure 

For the competition dataset video frames were extracted 
preserving the original aspect ratio. Then the Google 
Picasa face detector M was used to crop detected faces 
in each frame. To get the bounding box parameters 
in the original image, we used Haar-like features for 
matching, because direct pixel-to-pixel matching did 
not achieve the required performance. Picasa did not 
detect faces in every frame. To fix this, we searched the 
spatial neighborhood of the temporally closest bound¬ 
ing box for regions with an approximately matching 
histogram of color intensities. We used heuristics, such 
as the relative positioning, sizes and overlap, to asso¬ 
ciate bounding boxes of successive frames and generate 
one facetube for each subject in the video. 

For a few clips in the competition test sets, the Pi¬ 
casa face detector did not detect any faces. So we used 
the combined landmark placement and face detection 
method described in [40 to find faces in these clips. 
Using the facial keypoints output by that model we 
built bounding boxes and assembled them into face- 
tubes with the previously described procedure. 

Facetube smoothing In order to get image sequences 
where face sizes vary gradually, we applied a smoothing 
procedure on the competition facetube bounding boxes 
described in |3.1.3| For all images of a facetube, coor¬ 
dinates of the opposite corners of the bounding boxes 
were smoothed with a 2 -sided moving average (using a 
window size of 11 frames). The largest centered squares, 
that fit into these smoothed bounding boxes, yielded 


new bounding boxes which more tightly frame the de¬ 
tected faces. To restrict the amount of motion of the 
bounding boxes the same kind of smoothing was also 
applied to the center of the bounding boxes. 

Side lengths of the bounding boxes can vary due 
to changes of camera position or magnification (e.g. 
changing from a medium shot to a close-up shot). To 
be able to handle this, a further polynomial smooth¬ 
ing technique was applied directly on the bounding 
box side lengths. Two low-order polynomials of degree 
0 (constant) and 1 (linear) were fit through the side 
lengths of the bounding boxes. If the slope of the lin¬ 
ear polynomial is above a scale threshold (slope • face- 
tube length ), we use the values of the linear polynomial 
as side lengths, else we use values from the constant 
smoothing polynomial. Empirically, we found that a 
threshold of 1.5 yielded reasonable results. 

The final facetubes were then generated by cropping 
based on the smoothed bounding boxes and resizing the 
patches to 48x48. Per-frame emotion label probabilities 
were extracted for each facetube using the ConvNet. 


3.1.4 Aggregation into video descriptors and 
classification 

We aggregated the per-frame probabilities for all frames 
of a facetube for which a face was detected into a fixed- 
length video descriptor to be used as input to an SVM 
classifier. For this aggregation step we concatenated the 
seven-dimensional probability vectors of ten successive 
frames, yielding 70 dimensional feature vectors. Most 
videos have more than ten frames and some are too 
short and there are frames without detected faces. We 
resolved these problems using the following two aggre¬ 
gation approaches: 

— Video averaging: For videos that were too long, we 
averaged the probability vectors of 10 independent 
groups of frames taken uniformly along time, con¬ 
tracting the facetube to fit into the 10 -frame video 
descriptors. This is depicted in Figure [4] 

— For videos that contain too few frames with detected 
faces, we expanded by repeating frames uniformly to 
get 10 frames in total. This is depicted in Figure [5] 

The video descriptors for the training set were then 
used to train an SVM (implemented by 0) with a radial 
basis function (RBF) kernel. The hyperparameters, 7 
and c were tuned on the competition validation set. 
The SVM type used in all experiments was a C-SVM 
classifier and the outputs are probability estimates so 
that the fusion with other results was simpler. 
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Figure 3 The architecture of our ConvNet NM. 
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Figure 4 Frame aggregation via averaging 
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Figure 5 Frame aggregation via expansion 
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3.2 Audio & Deep Belief Networks 


validation using the competition validation dataset. We 
initially used a random search for hyperparameters and 
after the random search, we did manual finetuning of 
hyperpar amet ers. 

3.2.1 Audio Preprocessing 

Choosing the right features is a crucial aspect of the 
audio classification. Mel-frequency cepstral coefficients 
(MFCCs) are widely used for speech recognition; how¬ 
ever, in this task we are mainly interested in detecting 
emotions from the extracted audio features. 

On the other hand emotion recognition on film au¬ 
dio is quite different from other audio tasks. In addition 
to speech in the audio track, background noise and the 
soundtrack can also be significant indicators of emotion. 
For the EmotiW challenge, we extracted 29 features 
from each audio track using the yafee librar^Qwith a 
sampling rate of 48 kHz. We used all features provided 
by the yafee library except “Frames”. Additionally 3 
types of MFCC features are used, the first used 22 cep¬ 
stral coefficients, the second used a feature transforma¬ 
tion with the temporal first-order derivative and the 
last one employed second-order temporal derivatives. 
Online PC A was applied on the extracted features, and 
909 features per timescale were retained m ■ 


As we have described earlier, deep learning based tech¬ 
niques have led to important successes in speech recog¬ 
nition mM- In the context of emotion recognition on 
audio features extracted from movie clips, we used a 
deep learning approach for performing emotion recog¬ 
nition just by pretraining a deep MLP as a deep belief 
network (DBN) [T9]. A DBN is a probabilistic gener¬ 
ative model where each layer can be greedily trained 
as a Restricted Boltzmann Machine (RBM). Initially 
we trained the network as a DBN in an unsupervised 
manner with greedy layerwise training procedure and 
then we used supervised finetuning. In order to tune 
the hyperparameters of our model, we performed cross¬ 


3.2.2 DBN Pretraining 

We used unsupervised pre-training with deep belief net¬ 
works (DBN) on the extracted audio features. The DBN 
has three layers of RBMs, the first layer is a Gaussian 
RBM with noisy rectified linear unit (ReLU) nonlinear¬ 
ity 0 , the second and third layer are both Gaussian- 
Bernoulli RBMs. We trained the RBMs using stochas¬ 
tic maximum likelihood and contrastive divergence with 
one Gibbs step (CD-I). 

1 Yaafe: audio features extraction toolbox: http://yaaf e. 
sourcef orge.net/ 
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Each RBM layer had 350 hidden units. The first and 
second layer RBMs were trained with learning rates of 
0.0006, 0.0005 and 0.001 respectively. An L2 penalty of 
2 x 10 -3 and 2 x 10 -4 was used for the first and sec¬ 
ond layer, respectively. Both the first and second layer 
RBMs were trained for 15 epochs on the competition 
training dataset. We bounded the noisy ReLU activa¬ 
tions of the first layer Gaussian RBM, specifically we 
used the activation function: min(a, max(0, x + 
where ^ 7V(0,cr(x)) with a = 6. Otherwise large ac¬ 
tivations of the first layer RBM were causing problems 
training the second layer Gaussian Bernoulli RBM. We 
used a Gaussian model of the form 7V(0, cr(x)), with 0 
mean and standard deviation of <j(x) = 1 +ea ,p(_ x ) • At 
the end of unsupervised pre-trainining, we initialized 
a multilayer perceptron (MLP) with the ReLU non¬ 
linearity for the first layer and sigmoid non-linearity 
for the second layer using the weights and biases of the 
DBN. 


3.2.3 Temporal Pooling for Audio Classification 

We used a multi-time-scale learning model [16- for the 
MLP where we pooled the last hidden representation 
layer of an MLP so as to aggregate information across 
frames before a final softmax layer. We experimented 
with various pooling methods including max pooling 
and mean pooling, but we obtained the best results 
with a specifically designed type of pooling for the MLP 
features discussed below. 

Assume that we have a matrix A for the activations 
of the MLP’s last layer features that includes activa¬ 
tions of all timescales in the clip where A E R d t xd f and 
d t is the variable number of timescales, df is the num¬ 
ber of features at each timescale. We sort the columns 
of A in decreasing order and get the top N rows using 
the map / : R dtXd f R Nxd f. The most active N fea¬ 
tures are summarized with a weighted average of the 
top-N features: 

1 N 

0 ( 1 ) 

V i=0 

where /W(A;1V) is the i th highest active feature over 
time and weights should be: w i = N. During the 

supervised finetuning, we feed the reduced features to 
the top level softmax, we backpropagate through this 
pooling function to the lower layers. We only used the 
top 2 (TV = 2) most active features in the weighted 
average. Weights of the features were not learned and 
they were chosen as w\ = 1.4, = 0.6 during train¬ 
ing and w\ = 1.3, = 0.7 during test time. This kind 


of feature pooling technique worked best, if the fea¬ 
tures are extracted from a bounded nonlinearity such 
as sigmoid{.) or tanh(.). 


3.2.4 Supervised Fine-tuning 


The competition training dataset was used for super¬ 
vised fine-tuning and we applied early stopping by mea¬ 
suring the error on the competition validation dataset. 
The features were centered prior to training. Before ini¬ 
tiating the supervised training, we shuffled the order of 
clips. During the supervised fine-tuning phase, at each 
iteration on the training dataset, we randomly shuffled 
the order of the features in the clip as well. At each 
training iteration, we randomly dropped out 98 clips 
from the training dataset and we randomly dropped out 
40% of the features in the clip. 0.121 % of the hidden 
units are dropped out and we used a norm constraint 
on the weights such that the L2 norm of the incoming 
weights to a hidden unit does not exceed 1.2875 [20]. 
In addition to drop-out and maximum norm constraint 
on the weights, a L2 weight penalty with coefficient of 
10 -5 was used. The rmsprop adaptive learning rate al¬ 
gorithm was used to tune the learning rate with a vari¬ 
ation of Nesterov’s Momentum [36]. RMSProp scales 
down parameter updates by a running average of the 
gradient norm. At each iteration we keep track of the 
mean square of the gradients by: 

RMS(A t+1 ) = pRMS(A t ) + (1 - p)A\ (2) 


and compute the momentum, then do the stochastic 
gradient descent (SGD) update: 


Vt+l 


= fjLv t ~ e 0 


de t 


( 3 ) 


0t +1 = 0t + 


m+ 1 - £ o 


3/(a4 l) ;0 t ) 

ggt 


y/RMS(A t+1 ) 


( 4 ) 


After performing crossvalidation, we decided to use an 
eo = 0.0005 , fi = 0.46 and p = 0.92. We used early 
stopping based on the validation set performance, yield¬ 
ing an accuracy of 32.90%. Once supervised fine-tuning 
had completed 50 iterations, if the validation error con¬ 
tinued increasing, the learning rate was decreased by a 
factor of 0.99. 


3.3 Activity recognition using a relational autoencoder 

Given a video sequence with the task of extracting hu¬ 
man emotion labels, it seems reasonable to also con¬ 
sider the temporal evolution of image frames. To this 
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Figure 6 Subset of filters learned by SAE model on the AFEW2 training set. Left to right: Frames 1,3,5,7 and 9. 


end we employ an activity recognition system for emo¬ 
tion recognition based on local spatio-temporal feature 
computation. Using local motion features for activity 
recognition is a popular approach employed in many 
previous works |28il24ll37H39j . 

Traditional motion energy models [1| encode spatio- 
temporal features of successive video frames as sums of 
squared quadrature Fourier or Gabor coefficients across 
multiple frequencies and orientations [28] . Summing in¬ 
duces invariance w.r.t. content, allowing the model to 
yield a pure motion representation. In contrast to the 
motion energy view, in [24] it has been shown that the 
learning of transformations and introduction of invari¬ 
ance can be viewed as two independent aspects of learn¬ 
ing. Based on that view, a single layered autoencoder 
based model named synchrony autoencoder (SAE) for 
learning motion representations was introduced. The 
classic approach is to use hand-engineered features for 
spatio-temporal feature extraction [39 . In contrast to 
hand-engineered features, deep learning based meth¬ 
ods have been shown to yield low-level motion features, 
which generalize well across datasets mm- 

We use a pipeline commonly employed in works on 
activity recognition [28l[24ll39] with the SAE model for 
local motion feature computation. We chose to use the 
SAE model because, compared to other learning based 
methods like ISA [28] and convGBM m with com¬ 
plex learning rules, it can be trained very efficiently, 
while performing competitively. The activity recogni¬ 
tion pipeline follows a bag-of-words approach. It con¬ 
sists mainly of three modules: motion feature extrac¬ 
tion, K-means vector quantization and a x 2 kernel SVM 
for classification. The SAE model acts as feature extrac¬ 
tor. It is trained on small video blocks of size 10 x 16 x 16 
{time x rows x columns) randomly cropped from the 
competition training set. They are preprocessed using 
PC A for whitening and dimensionality reduction, re¬ 
taining 300 principal components. The number of ran¬ 
domly cropped training samples is 200, 000. The size of 
the SAE’s hidden layer was fixed at 300. The model was 
trained using SGD with a learning rate of 0.0001 and 
momentum 0.9 for 1, 000 epochs. The filters learned by 


the model on videos from the AFEW4 training set are 
visualized in Figure [6] 

In past works It has been shown that spatially com¬ 
bining local features learned from smaller input regions 
leads to better representations than features learned on 
larger regions [281 18]. Here, we utilize the same method 
by computing local feature descriptors for sub blocks 
cropped from the corners of a larger 14 x 20 x 20 “super 
block” and concatenating them, yielding a descriptor of 
motion for the region covered by the super block. PC A 
was applied to this representation for dimensionality 
reduction, retaining the first 100 principal components. 
To generate descriptors for a whole video, super blocks 
are cropped densely for each video with a stride of 7 on 
the temporal axis and 10 on the spatial axes, i.e. with 
50% overlap of neighboring super blocks. The K-means 
clustering step produces a dictionary of 3000 words, 
where each word represents a motion pattern. A nor¬ 
malized histogram over K — means cluster assignment 
frequencies was generated for each video as input to the 
classifier. 

In our experiments we observed that the classifier 
trained on the motion features seemed to overfit on the 
training set and all investigated measures to avoid this 
problem (e.g. augmenting the data set by randomly ap¬ 
plying affine transformations to the input videos) were 
also not helpful. This could be due to the videos show¬ 
ing little to no motion cues that correlate heavily with 
the emotion labels. The motion model by itself is not 
very strong at discriminating emotions, but it is useful 
in this task, nonetheless. It helps to disambiguate cases, 
where other modalities are not very confident, because 
it represents some characteristics of the data additional 
to those described by the other modalities. 


3.4 Bag of mouth features and shallow networks 


Some emotions may be recognized from mouth features. 
For example, a smile often indicates happiness while 
an “O”-shaped open mouth may signal surprise. For 
our submission, facetubes, described in section 3.1.3[ in 
resolution 96 x 96 were cropped around a region where 
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the mouth usually lies. This region was globally chosen 
by visualizing many training images, but a more precise 
method, such as mouth keypoint extraction [40 , could 
also be applied. 

We mostly follow the method introduced by Coates 
et al. [8], which achieved state-of-the-art performance 
on the CIFAR-10 dataset [25] in 2011, even though that 
method has since been superseded by convolutional net¬ 
works. As a first step, each mouth image is divided into 
16 equally sized sections, from which many 8x8 patches 
are extracted. These are normalized by individually set¬ 
ting the mean pixel intensity to 0 and the variance to 
1. After centering all patches from the same spatial re¬ 
gion, we apply whitening, which was shown to be useful 
for this kind of approach [8], keeping 90% of the vari¬ 
ance. For each of the 16 regions, 400 centroids are found 
by applying the k-means algorithm on the whitened 
patches. 

For any given image, patches are densely extracted 
from each of the 16 regions and pre-processed as de¬ 
scribed above. Each patch is assigned a 400-dimensional 
vector by comparing it to the centroids with the trian¬ 
gle activation function [8], where the Euclidean distance 
Zk between the patch and each centroid is computed, 
as well as the mean /i of these distances. The activation 
of each feature is given by max( 0 , /a — z/~), so that only 
centroids closer than the mean distance are assigned a 
positive value, while distant ones stay at 0. As we have a 
400-dimensional representation for each patch, the im¬ 
age representation would become extremely large if we 
simply concatenated all feature vectors. For this rea¬ 
son, we pool over all features of a region to get a local 
region descriptor. The region descriptors are then con¬ 
catenated to obtain a 6,400 dimensional representation 
of the image. 

This pooling generally uses the average activation of 
each feature, although we also tried taking the standard 
deviation across patches for each feature. A regularized 
logistic regression classifier is trained on a frame-by- 
frame basis with the pooled features as input. When 
classifying a test video, the predictions of the model 
are averaged over all its frames. 


4 Experimental results 

In figure [7] (a-d) we show the validation set confusion 
matrices from the models yielding the highest AFEW4 
validation set accuracy for each of the techniques dis¬ 
cussed in section [3] A second convolutional network for 
faces (Convnet #2), which we explored, is not presented 
here as it obtained lower performance compared to Con¬ 
vnet #1 and used similar information to make its pre¬ 
dictions. A more detailed analysis of Convnet #2 and 


comparisons on AFEW2 can be found in [22], but we 
provide some highlights here. 

AFEW2 From our experiments with AFEW2 we found 
that ConvNetl yielded the highest validation set accu¬ 
racy. We therefore selected this model as our first sub¬ 
mission and it yielded a test set accuracy of 35.58%. 
This is also indicated in table Q] which contains a sum¬ 
mary of all our submissions. ConvNet2 was our second 
highest performer, followed closely by the bag of mouth 
and audio models at 30.81%, 30.05% and 29.29% re¬ 
spectively. 


AFEW4 Here again our ConvNetl model achieved the 
best results on the validation set for AFEW4. It was 
followed by our audio model which here yields higher 
performance than the bag of mouths model by a good 
margin, at 34.20% and 27.42% accuracy respectively. 

We explored the strategies outlined in Sections |4.1[ 
4.21 and l4~3l to combine models for the AFEW2 evalua¬ 


tion. Section 4.4 presents the strategy we used for our 
experiments with the AFEW4. 


4.1 Averaged Predictions - AFEW2 

A simple way to make a final prediction using several 
models is to take the average of their predictions. We 
had 5 models in total, which gives Y17=i (™) = 31 pos¬ 
sible combinations (order has no importance). In this 
context it is possible to test all combinations on the val¬ 
idation set to find those which are the most promising. 

Through this analysis we found that the average 
of all models yielded the highest validation set perfor¬ 
mance of 40.15% on AFEW2. The validation set con¬ 
fusion matrix for this model is shown in figure 11(a)- 
For our third 2013 submission we therefore submitted 
the results of the averaged predictions of all models, 
yielding 37.17% on the test. From this analysis we also 
found that the exact same validation set performance 
was also obtained with an average not including our 
second convolutional network, leading us to make the 
conclusion that both convolutional networks were pro¬ 
viding similar information. We thus left it out for sub¬ 
sequent strategies and experiments on the AFEW4. 

The next highest performing simple average was 
39.90% and consisted of simply combining ConvNet 1 
and our audio model. Given this observation and the 
fact that the conference baselines included both video, 
audio and combined audio-video models we decided to 
submit a model in which we used only these two models. 
However, we first explored a more sophisticated way to 
perform this combination. 
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Table 1 Our 7 submissions with training, validation and test accuracies for the EmotiW 2013 competition. 


Sub. 

Train 

Valid 

Test 

1 

45.79 

38.13 

35.58 

2 

71.84 

42.17 

38.46 

3 

97.11 

40.15 

37.17 

4 

98.68 

43.69 

32.69 

5 

94.74 

47.98 

39.42 

6 

94.74 

48.48 

40.06 

7 

92.37 

49.49 

41.03 


Method 

Google data & TFD used to train ConvNet 1, SVM trained on aggregated frame scores 
ConvNet 1 (from submission 1) combined with Audio model using another SVM 
Mean prediction from: Activity, Audio, Bag of mouth, ConvNet 1, ConvNet 2 
SVM with detailed hyperparameter search: Activity, Audio, Bag of mouth, ConvNet 1 
Short uniform random search: Activity, Audio, Bag of mouth, CN1, CN1 + Audio 
Short local random search: Activity, Audio, Bag of mouth, CN1, CN1 + Audio 
Moderate local random search: Activity, Audio, Bag of mouth, CN1, CN1 + Audio 


Table 2 Our selected submissions with test accuracies for the EmotiW 2014 competition. 


Sub. Test 

1 39.80 

2 37.84 

3 44.71 

4 41.52 

5 37.35 

6 42.26 

7 44.72 

8 42.51 

9 47.67 

10 45.45 


Method 

Trained model on 2013 data, BoM failed due to different data format and replaced by uniform 

Trained model on 2013 data, re-learning random search without failed BofM 

ConvNet 1 + Audio model combined with SVM, all trained on train+valid 

ConvNet 1 + Audio model combined with SVM trained on swapped predictions 

Google data & TFD used to train ConvNet 1, frame scores aggregated with SVM 

All models combined with SVM trained on validation predictions 

All models combined with random search optimized on validation predictions 

Only two models were trained on train+validation in combination, others used train set only 

All models combined with random search optimized on full swapped predictions 

Bagging of 350 models similar to submission 9 
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Figure 7 Confusion matrices for the AFEW4 validation 
set. Accuracies for each method are specified in parentheses 
(training, validation & test sets, if applicable). *Model has 
been retrained on both training and validation set prior to 
testing 


(c) ConvNet 1 & Audio, (d) All modalities (submis- 

(-, -, 44.71*) sion 9), (-, -, 47.67*) 

Figure 8 Confusion matrices on the test set of AFEW2 
(a-b) and AFEW4 (c-d). Accuracies for each method are 
specified in parentheses (training, validation & test sets, if 
applicable). *Model has been retained on both training and 
validation set prior to testing 
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4.2 SVM and MLP Aggregation Techniques - AFEW 2 

To further boost the performance of our combined au¬ 
dio and video model we simply concatenated the results 
of our ConvNet 1 and audio model using vectors and 
learned a SVM with an RBF kernel using the challenge 
training set. The hyperparameters of the SVM were set 
via a two stage coarse, then fine grid search over integer 
powers of 10 , then non-integer powers of 2 within the 
reduced region of space. The hyperparameters corre¬ 
spond to a kernel width term, 7 and the c parameter of 
SVMs. This process yielded an accuracy of 42.17% on 
the validation set, which became our second submission 
and produced a test accuracy of 38.46%. 

Given the success of our SVM combination strategy, 
we tried the same technique using the predictions of all 
models. However, this process quickly overfit the train¬ 
ing data and we were not able to produce any models 
that improved upon our best validation set accuracy 
obtained via the ConvNet 1 and audio model. We ob¬ 
served a similar effect using a strategy based upon an 
MLP to combine the results of all model predictions. 

We therefore tried a more sophisticated SVM hy¬ 
perparameter search to re-weight different models and 
their predictions for different emotions. We implemented 
this via a search over discretized [0,1,2,3] per dimen¬ 
sion scaling factors. While this resulted in 28 additional 
hyperparameters this discretization strategy allowed us 
to explore all combinations. This more detailed hyper¬ 
parameter tuning did allow us to increase the validation 
set performance to 43.69%. This became our fourth 
2013 submission; however, the strategy yielded a de¬ 
creased test set performance at 32.69%. 

4.3 Random Search for Weighting Models - AFEW 2 

Recent work |3] has shown that random search for hy¬ 
perparameter optimization can be an effective strat¬ 
egy, even when the dimensionality of hyperparameters 
is moderate (ex. 35 dimensions). Analysis of our valida¬ 
tion set confusion matrices shows that different models 
have very different performance characteristics across 
the different emotion types. We therefore formulated 
the re-weighting of per-model and per-emotion predic¬ 
tions as a hyperparameter search over simplexes, weight¬ 
ing the model predictions for each emotion type. 

To perform the random search, we first sampled ran¬ 
dom weights from a uniform distribution and then nor¬ 
malized them to produce seven simplexes. This process 
is slightly biased towards weights that are less extreme 
compared to other well known procedures that are ca¬ 
pable of generating uniform values on simplexes. After 
running this sampling procedure for a number of hours 


we used the weighting that yielded the highest valida¬ 
tion set performance (47.98%) as our 5th 2013 submis¬ 
sion. This yielded a test set accuracy of 39.42%. We 
used the results of this initial random search to initiate 
a second, local search procedure which is analogous in a 
sense to the typical two level coarse, then fine level grid 
search used for SVMs. In this procedure we generated 
random weights using a Gaussian distribution around 
the best weights found so far. The weights were tested 
by calculating the accuracy of the so-weighted aver¬ 
age predictions on the validation set. We also rounded 
these random weights to 2 decimals to help to avoid 
overfitting on the validation set. This strategy yielded 
40.06% test set accuracy with a short duration search 
and 41.03% with a longer search - our best performing 
2013 submission on the test. The validation set confu¬ 
sion matrix for this model is shown in figure [ 8 ] (b) and 
the weights obtained through this process are shown in 
figure [9] (a). 
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Figure 9 Final weights used for model averaging in our best 
submissions. 


4.4 Strategies for the Emotiw 2014 Challenge and the 
AFEW4 Data 

While we did not participate in the EmotiW 2014 chal¬ 
lenge we have performed a sequence of experiments us¬ 
ing the underlying AFEW4 dataset and the training, 
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validation and test sets partitions defined by the chal¬ 
lenge organizers. We have performed these experiments 
after the challenge period so as to explore the behaviour 
of our general technique as well as some different train¬ 
ing strategies arising from the fact that the challenge 
is defined differently. Specifically, in the EmotiW 2014 
challenge it is permitted to re-train all models using 
the combined training and validation set if desired. We 
correspondingly explored the following set of strategies. 

As an initial step, we simply re-ran our best model 
from the 2013 competition, without retraining it on the 
2014 competition dataset. Predictions of Bag-of-mouth 
model were replaced by uniform distribution. Our Bag- 
of-mouth model was trained on faces provided by the 
organizers which were RGB in 2013 and grayscale in 
2014, this caused the model to fail on new dataset. Us¬ 
ing models trained on AFEW2, we computed predic¬ 
tions on AFEW4 test set, which gave 39.80% accuracy. 
The 1% loss could possibly be attributed to the substi¬ 
tution of the Bag-of-mouth model with uniform distri¬ 
bution. However, sound comparison with previous re¬ 
sults cannot be made as AFEW2 and AFEW4 test sets 
are different. Retraining the combination model on all 
models trained on AFEW2 but bag-of-mouth resulted 
in a lower 37.84% accuracy. We used a more aggres¬ 
sive random search procedure by starting hundreds of 
random searches with different initializations. The gen¬ 
eralization decrease from submission 1 to 2 was most 
likely caused by overfitting because of this aggressive 
random search. Nevertheless, as AFEW4 training and 
validation sets are larger than their AFEW2 relatives, 
models trained on the latter might not be competitive 
in the Emotiw 2014 Challenge. Therefore, we trained 
our models on AFEW4 data for submission 3 to 10. 

In preparation for the following sets of experiments 
all sub-models were trained on training set and valida¬ 
tion set alone. They were also trained on training set 
combined with validation set. This yields three differ¬ 
ent sets of predictions from which one may explore and 
compare different training and combination strategies. 
Training on the training set and the validation set sep¬ 
arately allowed us to easily do 2-fold cross-validation, 
while training on all data combined is a commonly used 
strategy to exploit all available data but can involve dif¬ 
ferent techniques for setting model hyperparameters. 

4-4-1 An SVM combination approach using all data 

One simple method for learning when working with a 
single training set and a single validation set is to use 
the training set to train a model multiple times with 
different hyperparameters, then select the best model 
using the validation set. One can then simply use these 


hyperparameter settings and retrain the model using 
the combined training and validation set. This method 
is known to work well in practice. 

We first used this method to train an SVM to com¬ 
bine the predictions of the ConvNetl model and the 
audio model. It resulted in 44.71% test accuracy, an im¬ 
pressive 7% improvement over ConvNetl alone (37.35%) 
and 6% improvement over the same combination trained 
only on the 2013 AFEW2 training set (38.26%). An im¬ 
portant factor might be that we are using predictions 
on data not seen during sub-model training to train the 
combination model. That is, they are less biased than 
training predictions, which makes it possible for the 
SVM to generalize better. The validation set alone is, 
however, too small to train a good combination model. 

To capitalize on this effect, we trained another SVM 
on swapped predictions, i.e. the predictions on the vali¬ 
dation set came from sub-models trained on training set 
and predictions on training set came from sub-models 
trained on the validation set. An SVM was trained on 
both swapped sets separately to select the best hyper¬ 
parameters before training a final SVM on all swapped 
predictions. With 41.52% test accuracy, this model is 
worse then the previous one (44.71%). A possible reason 
for this is that the training and validation sets are un¬ 
balanced and relatively small. Good sub-models trained 
on the larger training set tend to generate good predic¬ 
tions on small validation sets, while worse sub-models 
trained on the small validation set generate worse pre¬ 
dictions on the bigger training set. An obvious solution 
would be to generate swapped predictions in a manner 
similar to leave-one-out cross-validation, the drawback 
is that for our setting we would need to train 5 times 
900 models on each fold to generate the predictions for 
the meta-model. 

Finally, similar to section [T3j we trained the SVM 
only on validation data. We hoped training an SVM 
would yield results similar to running random search. It 
did not. As explained in next section, running random 
search on the validation set predictions gives 44.72% 
while training an SVM on same data gives only 42.26%. 

4-4-2 Weighting models and random search using all 
data 

A random search procedure for determining the pa¬ 
rameters of a linear per-class and per-model weighting 
was computed as described in section [T3j but for the 
AFEW4 (EmotiW 2014 challenge data). For our first 
experiment we run a random search using the validation 
set predictions, then used the resulting weights to com¬ 
pute the weighted average of predictions of sub-models 
trained on all data. To be clear, the only difference to 
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Table 3 Test accuracies of different approaches on AFEW2 
(left) and AFEW4 (right) 


Method 

% 

Method 

% 

MKL [32] 

35.89% 

PLS [30] 

50.37% 

PLS HO 

34.61% 

HCF EH 

47.17% 

Linear SVM |13| 

29.81% 

MKL [7] 

45.21% 

Our method [22] 

41.03% 

Our method 

47.67% 


our best model from 2013 submissions, was that we ap¬ 
plied the weighted average on sub-models trained on 
the combined training and validation set of the 2014 
dataset. This yielded a test accuracy of 44.72%, 2% 
higher than the same procedure with SVM training, 
but no gain over the best combination of ConvNetl 
with audio models (44.71%). 

Random search can also be applied to swapped pre¬ 
dictions such as those explained in the previous sec¬ 
tion. Running random search on such predictions gave 
our best results on AFEW4, 47.67%, slightly higher 
than the first runner up in the EmotiW 2014 compe¬ 
tition [34] . The weights found through this procedure 
are shown in Figure [9] (b). A comparison of test accura¬ 
cies for both the 2013 and 2014 EmotiW datasets with 
other methods is shown in table [3] 

As some models were overfitting to the training data, 
we tried to separate overfitters from the other models 
and combine them together. We ran a random search 
on ConvNetl, Bag-of-mouth and activity recognition 
predictions of validation data. Then we ran a second 
random search on top of their weighted average with 
our ConvNetl+Audio SVM combination of submission 
3. This final weighted average was used to compute the 
test predictions, giving only 42.51%. 

Weights found by random search varied a lot from 
one run to another. We tried bagging of 350 indepen¬ 
dent weighted averages found by random searches sim¬ 
ilar to submission 9 (which obtained 47.67%). Surpris¬ 
ingly, the bagging approach achieved a lower accuracy 
of 45.45%, our second best result on AFEW4. 

5 Conclusions and discussion 

Our experiments with both competition datasets (2013 
and 2014) have lead to a number of contributions and 
insights which we believe may be more broadly appli¬ 
cable. First, we believe that our approach of using the 
large scale mining of imagery from Google image search 
to train our deep neural network has helped us to avoid 
overfitting to the provided challenge dataset. 

We achieved better performance when we used the 
competition data exclusively for training the classifier 
and used additional face image data for training of the 


convolutional network. The validation set accuracy was 
significantly higher than in our experiment in which we 
trained the network directly on extracted faces from 
the challenge data. It is our intuition that video frames 
in isolation are not always representative of the emo¬ 
tional tag assigned to the clip, and using one label for 
video length introduces noise to the training set. In 
contrast, our additional data contained only still im¬ 
ages with a clear correspondence between image and 
label. The problem of overfitting had both direct conse¬ 
quences on per-model performance on the validation set 
as well as indirect consequences on our ability to com¬ 
bine model predictions. Our analysis of simple model 
averaging showed that no combination of models could 
yield superior performance to an SVM applied to the 
outputs of our audio-video models. Our efforts to create 
both SVM and MLP aggregation models lead to similar 
observations in that models quickly overfit the train¬ 
ing data and no settings of hyper parameters could be 
found which would yield increased validation set per¬ 
formance. We believe this is due to the fact that the 
activity recognition and bag of mouth models severely 
overfit the challenge training set and the SVM and MLP 
aggregation techniques - being quite flexible - overfit the 
data in such a way that no traditional hyperparameter 
tuning could yield validation set performance gains. 

These observations led us to develop the novel tech¬ 
nique of aggregating the per model and per class pre¬ 
dictions via random search over simple weighted aver¬ 
ages. The resulting aggregation technique is therefore 
of extremely low complexity and the underlying pre¬ 
diction was therefore highly constrained - using simple 
weighted combinations of complex deep network mod¬ 
els, each of which did reasonably well at this task. We 
were therefore able to explore many configurations in 
a space of moderate dimensionality quite rapidly as we 
did not need to re-evaluate the predictions from the 
neural networks and we did not adapt their parameters. 

As this obtained a marked increase in performance on 
both the challenge validation and test sets, it lead us to 
the following interpretation: Given the presence of mod¬ 
els that overfit the training data, it may be better prac¬ 
tice to search a moderate space of simple combination 
models. This is in contrast to traditional approaches 
such as searching over the smaller space of SVM hyper¬ 
parameters or even a moderately sized space of tradi¬ 
tional MLP hyperparameters including the number of 
hidden layers and the number of units per layer. 
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